Supporting Parallelism in Server-based Multiprocessor Systems 



Luis Nogueira, Luis Miguel Pinho 
CISTER Research Centre 
School of Engineering (ISEP), Polytechnic Institute of Porto (IPP), Portugal 

{lmn,lmp}@isep. ipp.pt 



O 

(N 

C 
3 



A.: 



> 
(N 

o 



X 



Abstract 

Developing an efficient server-based real-time 
scheduling solution that supports dynamic task-level 
parallelism is now relevant to even the desktop and 
embedded domains and no longer only to the high 
performance computing market niche. This paper 
proposes a novel approach that combines the constant- 
bandwidth server abstraction with a work-stealing 
load balancing scheme which, while ensuring isolation 
among tasks, enables a task to be executed on more 
than one processor at a given time instant. 



1 Introduction 

The constant-bandwidth server abstraction has 
proved very useful in designing, implementing, and rea- 
soning about single core open real-time systems where 
tasks can dynamically enter or leave the system at any 
time [T2]. Each task is assigned a fraction of the com- 
putational resources and it is handled by an abstract 
entity called server to achieve the goals of temporal 
isolation and real-time execution pQ. 

However, modern open real-time systems increas- 
ingly generate heavy workloads and it is rapidly be- 
coming unreasonable to expect to implement them 
as single core systems. In fact, a general shift from 
unicore to multicore processors can be seen both 
in the general purpose and embedded domains as 
an energy-efficient way to boost applications' perfor- 
mance. Therefore, there have been significant efforts 
to extend reservation-based real-time scheduling the- 
ory to make it applicable to multiprocessor systems as 

weii En naming. 

Nevertheless, all these works consider task models 
where tasks use at most a single core at each time 
instant. This restriction is natural for uniprocessor 
scheduling since only one processor is available at any 
time, even if we deal with parallel algorithms. However, 
the need for parallel processing - simultaneous use of 



several processors for an individual task - is steadily in- 
creasing, even in the desktop and embedded domains 
and no longer only on the comparably small high per- 
formance computing market niche. Therefore, for fully 
utilising the parallel abilities of multicore platforms, we 
should be able to support tasks that may be executed 
on different cores at the same time instant. 

There are many computations that can be relatively 
easily parallelised by using frameworks such as Cilk 
[5], Intel's Parallel Building Blocks [5], Java Fork-join 
Framework [TT], Microsoft's Task Parallel Library [BJ, 
or StackThreads/MP [T5\. These frameworks encour- 
age application developers to create many more paral- 
lel jobs (hereafter called pjobs) than there are available 
CPUs. The division of work among pjobs is often im- 
perfect, and the system must provide an efficient run- 
time that can efficiently map ready pjobs to processors, 
thus dynamically balancing the workload. 

One of the simplest, yet best-performing, dynamic 
load-balancing algorithms for shared-memory architec- 
tures is work-stealing [3]. Blumofe and Leiserson have 
theoretically proven that the work-stealing algorithm 
is optimal for scheduling fully-strict computations [4]. 
Under this assumption, an application running on P 
processors achieves P-fold speedup in its parallel part, 
using at most P times more space than when running 
on one CPU. These results are also supported by ex- 
periments [14] . 

This paper discusses the general guidelines of a novel 
scheduling approach for parallel runtimes that will co- 
exist with a wide range of other complex independently 
developed applications, without any previous knowl- 
edge about their real execution requirements, num- 
ber of pjobs, and when those pjobs will be generated. 
Schedulers in these type of open systems are there- 
fore required to maintain a certain (quantifiable) level 
of service for each application, with the exact guar- 
antee depending upon the CPU reservation's parame- 
ters. The proposed approach combines a work-stealing 
policy with multiprocessor constant-bandwidth servers 
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which, while ensuring isolation among tasks, allows a 
task to be executed in more than one processor at a 
given time. To the best of our knowledge, no research 
has ever focused on this subject. 

2 System model 

We consider the scheduling of sporadic independent 
tasks on m identical processors p±,p2, ■ ■ ■ ,p m using 
global EDF. With global EDF, each task ready to exe- 
cute is placed in a system- wide queue, ordered by non- 
decreasing absolute deadline, from which the first m 
tasks are extracted to execute on the available proces- 
sors. 

A pool of worker threads is established. Assume that 
there will be as many worker threads as there are CPUs 
on a system. A special purpose accounting discipline is 
used to manage tasks and execute them via the worker 
threads. 

Each task Tj can generate a virtually infinite se- 
quence of jobs. The arrival time dij of the j th job 
of a task Tj is only revealed at runtime and the exact 
execution requirements e^j can only be determined by 
actually executing the job to completion until time . 
All jobs generated by a task n are dedicated to a p- 
CSS server Si, an extension for the multicore case of 
the Capacity Sharing and Stealing scheduler [12] . Each 
server Si is characterised by a pair (Qj,Tj), where Qi 
is the server's maximum reserved capacity and Ti its 
period. The ratio Ui = jr- denotes the fraction of the 
capacity of one processor that is assigned to the server. 

At each instant, the following values are associated 
with a server Sf. its currently assigned deadline d\, its 
remaining execution capacity < c k < Qi, the amount 
of residual capacity r\ < c\ that can be reclaimed by 
other servers, and its currently assigned replenishment 
time h\ = d l k . If at time t, Si finishes the execution of 
its currently served job without exhausting its reserved 
execution capacity c k and it has no pending work, the 
remaining amount c\ > sets the server's residual ca- 
pacity r\ = c\ that can be reclaimed by other servers 
(c k is subsequently set to zero). By pending work we 
refer to the case when there exists at least a served job 
such that its release time is Sij <t< ftj. 

During the course of its execution a job can 
spawn, at any time, a set of parallel jobs 
{pjobi s,pjdbi,2, ■ ■ ■ ,pjobi }U }, sequential pieces of work 
that can be executed on different processors at the 
same time instant using the available execution capac- 
ity of their corresponding task. For now, our work is 
focused on systems where all pjobs are fully indepen- 
dent, i.e., except for the m-cores there are no other 
shared resources, no critical sections, nor precedence 
constraints. 



Contrary to regular jobs of a task, pjobs are not 
pushed to the global EDF queue but instead main- 
tained in a worker's local work-stealing double-ended 
queue (deque) to reduce contention on the global 
queue. Any pjob in the work-stealing queue can be 
shared with any other worker thread. A worker thread 
first looks into its local queue. If there is no pjob to 
pick, then it searches the global EDF queue. Still, 
if there is no eligible jotQ in the global EDF queue, 
the worker will steal the earliest deadline eligible pjob 
from the top of other busy worker's deque. For a busy 
worker, pjobs are pushed and popped from the bottom 
of the deque and these operations are synchronisation- 
free. 

3 Multicore Capacity Sharing and 
Stealing 

In this paper, we consider a periodic task model in 
which jobs may spawn a set of parallel jobs, indepen- 
dent sequential pieces of code that may have different 
execution costs but a common period. Multithreaded 
jobs such as this arise naturally in many settings. For 
example, in multimedia applications, multiple threads 
may be useful for performing different functions on 
common data (e.g., a frame of an MPEG video) at the 
same rate. Our goal is to find an efficient scheduling 
framework for these parallel runtimes while ensuring 
temporal isolation among applications and guarantee- 
ing a certain degree of service to each individual appli- 
cation in open real-time systems. 

Since our management of reserved capacities is 
based on our previous work for uniprocessor systems 
|12) . we will start by describing the capacity shar- 
ing and stealing approach of CSS. CSS extends CBS 
P] with a powerful strategy that supports the coex- 
istence of guaranteed (isolated) and non- guaranteed 
(non-isolated) bandwidth servers to efficiently han- 
dle soft-tasks' overloads by making additional capacity 
available from two sources: (i) reclaiming unused re- 
served capacity when jobs complete in less than their 
budgeted execution time and (ii) stealing reserved ca- 
pacity from inactive non-isolated servers used to sched- 
ule best-effort jobs. 

Whenever a job is being executed, the consumed 
execution capacity must be decreased by the same 
amount. By dynamically managing a pointer to the 
server from which the capacity is going to be decreased, 
the proposed dynamic accounting mechanism of CSS 

1 An eligible job j'j j is one in which its dedicated server Si is 
able to execute jij by either consuming its own reserved capacity 
Cj > 0, reclaiming any available capacity rj. with deadline < 
d' k ,r' k > 0, or stealing a non-isolated capacity cj with deadline 
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eliminates the need of extra queues or additional server 
states, reducing its overhead. The server from which 
the accounting is going to be performed is dynami- 
cally determined at the time instant when a capacity 
is needed. CSS uses the following rules to manage re- 
served capacities: 

• Rule A (residual capacity release): Whenever 
a server Sj completes its k th job of its associated 
task Tj and it has no pending work, its remaining 
reserved capacity cj. > is released as residual ca- 
pacity r° k — cj, and c k is set to zero. The released 
residual capacity ri can immediately be reclaimed 
by eligible active servers until the currently as- 
signed Sj's deadline diL. Sj is kept active with its 
current deadline until its residual capacity r\. is 
exhausted by other servers. 

• Rule B (residual capacity reclaim): The next 
active server Si scheduled for execution points to 
the earliest deadline server S e df from the set of el- 
igible active servers A r for capacity reclaiming. Si 
consumes the pointed residual capacity , run- 
ning with the deadline d k of the pointed server 
Sedf- Whenever r k is exhausted and there is 
pending work, Si disconnects from S ec if and se- 
lects the next available server S' ed f (if any) . 

• Rule C (dedicated capacity consumption): 

If all eligible residual capacities are exhausted and 
the current k th job of server Si is not yet com- 
pleted, Si consumes its own reserved capacity c\ 
either until the job's completion or c^'s exhaus- 
tion (whatever comes first). If cl is exhausted and 
there is still pending work to do, Si is kept active 
with its current deadline d l k . 

• Rule D (inactive non-isolated capacity 
steal): A server Si with pending work and no 
available execution capacity {c\ — 0) connects to 
the earliest deadline server S e df from the set of 
eligible inactive non-isolated server I s . Si steals 
the pointed inactive capacity ct , running with its 
current deadline d k . Whenever c k ^ is exhausted 
and the job has not yet been completed, the next 
non-isolated capacity c k is used (if any). 

We are currently investigating how these rules can 
be extended for multicore platforms. Due to well- 
known multiprocessor scheduling anomalies [2] , the ex- 
tension is not trivial and adopting the same rules as the 
uniprocessor case would lead to deadline violations in 
spite of the fact that the considered task set is schedu- 
lable by using a global EDF scheduler. A possible ap- 



proach for reclaiming residual capacities has been pro- 
posed in M-CASH [13] (residual capacities are equally 
distributed across all processors, including idle ones) 
but no work is known for handling capacity stealing in 
the multicore case. 

4 Work stealing in the presence of jobs' 
priorities 

Work-stealing schedulers are increasing in popular- 
ity as scheduling algorithms for dynamic task paral- 
lelism. A work-stealing scheduler employs a fixed num- 
ber of threads called workers. Each of those workers 
has a local deque to store tasks. Whenever a worker 
has no local tasks to execute it will try to steal a task 
from the top of other busy worker's deque. Thus, it 
must choose which processor will be stolen and which 
task will be taken. These choices lead to variations of 
the work-stealing algorithm and are the main issue of 
this section. 

Blumofe and Leiserson [4] demonstrate that a ran- 
dom choice of the stolen processor is fair. Furthermore, 
random choices present the advantage that the choice 
of the target does not require more information than 
the total number of processors in the executive plat- 
form. Then, the thread steals a task from the back of 
the run-queue of the randomly chosen thread. The rea- 
sons for accessing the run queues at different ends are 
several 0: (i) it reduces contention by having steal- 
ing threads operate on the opposite end of the queue 
than the thread they are stealing from; (ii) it works 
better for parallelised divide-and-conquer algorithms 
which typically generate large chunks of work early, so 
the older stolen task is likely to further provide more 
work to the stealing thread; and (iii) stealing a task also 
migrates its future workload, which helps to increase 
locality. All queue manipulations run in constant-time 
(0(1)), independently of the number of tasks in the 
queues. 

However, the algorithm does not take tasks' dead- 
lines into account when stealing a task from another 
worker. Also, a task must voluntarily yield or block 
before another task can be scheduled on the same CPU 
or otherwise it will run to completion. Thus, the need 
to support tasks' deadlines and CPU reservations fun- 
damentally distinguishes the problem at hand in this 
paper from other work-stealing choices previously pro- 
posed in the literature. 

Our proposal is to apply work-stealing to enable 
multithreaded jobs to be executed on more than one 
processor. Recall that a job j^k assigned to a partic- 
ular worker thread by a global EDF policy, can spawn 
a set of pjobs at any time during the course of its exe- 
cution. Pjobs are not pushed to the global EDF queue 
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but instead pushed to the bottom of the worker's local 
deque. Pjobs are dedicated to same server Sj of job 
ji t k, ensuring isolation among tasks. 

Then, while there are pending pjobs on the local 
deque, the worker should purposefully select the bot- 
tommost pjob, which is the pjob with the highest prob- 
ability of still being in the cache. Hence, there are per- 
formance improvements in processing local queues in a 
LIFO order. Note that in this case the job is sequen- 
tially executed as if there was no support for parallel 
execution of pjobs. On the other hand, work steal- 
ing allows an idle worker thread to perform some of 
the pjobs in other overloaded processor's queue. Thus, 
whenever a worker thread has no pending pjobs in its 
local deque and the first pjob on the global EDF queue 
has a greater deadline than at least one of the eligi- 
ble pjobs at the top of the other workers' deques, the 
worker thread should steal the earliest deadline eligi- 
ble pjob from the topmost pjobs on the other workers' 
deques. 

We believe that this deadline-based work-stealing 
policy will positively increase the speedup of parallel 
applications without jeopardising the schedulability of 
the other sequential jobs scheduled by global EDF. We 
are currently investigating such claim. 

5 Conclusions and future work 

This paper discussed the increased need to support 
dynamic task-level parallelism in open real-time sys- 
tems and proposed the general guidelines of a novel 
scheduling approach that combines a work-stealing 
load balancing policy with a multicore reservation- 
based approach. 

Our current efforts are focused on a theoretical val- 
idation of the proposed approach. It is our belief that 
the ideas discussed here will improve the execution effi- 
ciency of parallel tasks while continuing to achieve iso- 
lation among tasks whose resource demands are only 
know at runtime. We plan to evaluate the efficiency of 
the approach in real-world scenarios by implementing 
it on top of SCHED_DEADLINE [5], a patch made for 
the Linux kernel which implements an EDF scheduler 
with a CPU reservation mechanism based on CBS. 
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